Problem Statement¶
Business Context¶
Understanding customer personality and behavior is pivotal for businesses to enhance customer satisfaction and increase revenue. Segmentation based on a customer's personality, demographics, and purchasing behavior allows companies to create tailored marketing campaigns, improve customer retention, and optimize product offerings.
A leading retail company with a rapidly growing customer base seeks to gain deeper insights into their customers' profiles. The company recognizes that understanding customer personalities, lifestyles, and purchasing habits can unlock significant opportunities for personalizing marketing strategies and creating loyalty programs. These insights can help address critical business challenges, such as improving the effectiveness of marketing campaigns, identifying high-value customer groups, and fostering long-term relationships with customers.
With the competition intensifying in the retail space, moving away from generic strategies to more targeted and personalized approaches is essential for sustaining a competitive edge.
Objective¶
In an effort to optimize marketing efficiency and enhance customer experience, the company has embarked on a mission to identify distinct customer segments. By understanding the characteristics, preferences, and behaviors of each group, the company aims to:
- Develop personalized marketing campaigns to increase conversion rates.
- Create effective retention strategies for high-value customers.
- Optimize resource allocation, such as inventory management, pricing strategies, and store layouts.
As a data scientist tasked with this project, your responsibility is to analyze the given customer data, apply machine learning techniques to segment the customer base, and provide actionable insights into the characteristics of each segment.
Data Dictionary¶
The dataset includes historical data on customer demographics, personality traits, and purchasing behaviors. Key attributes are:
Customer Information
- ID: Unique identifier for each customer.
- Year_Birth: Customer's year of birth.
- Education: Education level of the customer.
- Marital_Status: Marital status of the customer.
- Income: Yearly household income (in dollars).
- Kidhome: Number of children in the household.
- Teenhome: Number of teenagers in the household.
- Dt_Customer: Date when the customer enrolled with the company.
- Recency: Number of days since the customer’s last purchase.
- Complain: Whether the customer complained in the last 2 years (1 for yes, 0 for no).
Spending Information (Last 2 Years)
- MntWines: Amount spent on wine.
- MntFruits: Amount spent on fruits.
- MntMeatProducts: Amount spent on meat.
- MntFishProducts: Amount spent on fish.
- MntSweetProducts: Amount spent on sweets.
- MntGoldProds: Amount spent on gold products.
Purchase and Campaign Interaction
- NumDealsPurchases: Number of purchases made using a discount.
- AcceptedCmp1: Response to the 1st campaign (1 for yes, 0 for no).
- AcceptedCmp2: Response to the 2nd campaign (1 for yes, 0 for no).
- AcceptedCmp3: Response to the 3rd campaign (1 for yes, 0 for no).
- AcceptedCmp4: Response to the 4th campaign (1 for yes, 0 for no).
- AcceptedCmp5: Response to the 5th campaign (1 for yes, 0 for no).
- Response: Response to the last campaign (1 for yes, 0 for no).
Shopping Behavior
- NumWebPurchases: Number of purchases made through the company’s website.
- NumCatalogPurchases: Number of purchases made using catalogs.
- NumStorePurchases: Number of purchases made directly in stores.
- NumWebVisitsMonth: Number of visits to the company’s website in the last month.
Problem Definition¶
The company is into retail and wants to segment an ever-increasing customer base based on personality traits, demographics, and buying behavior with a view to effective marketing and customer experience. Traditional one-size-fits-all marketing techniques cannot work when data-driven approaches are necessary in personalizing customer interactions. The company will make use of machine learning techniques to identify distinct groups of customers for targeted campaigns, better retention strategies, and efficient resource allocation. The ultimate goal of this is enhancing customer satisfaction for increased revenue and thereby competitiveness within the changing retail environment.
Importing necessary libraries¶
# Libraries to help with reading and manipulating data
import pandas as pd
import numpy as np
# libaries to help with data visualization
import matplotlib.pyplot as plt
import seaborn as sns
# Removes the limit for the number of displayed columns
pd.set_option("display.max_columns", None)
# Sets the limit for the number of displayed rows
pd.set_option("display.max_rows", 200)
# to scale the data using z-score
from sklearn.preprocessing import StandardScaler
# to compute distances
from scipy.spatial.distance import cdist, pdist
# to perform k-means clustering and compute silhouette scores
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
# to visualize the elbow curve and silhouette scores
from yellowbrick.cluster import KElbowVisualizer, SilhouetteVisualizer
# to perform hierarchical clustering, compute cophenetic correlation, and create dendrograms
from sklearn.cluster import AgglomerativeClustering
from scipy.cluster.hierarchy import dendrogram, linkage, cophenet
# to suppress warnings
import warnings
warnings.filterwarnings("ignore")
Loading the data¶
# uncomment and run the following line if using Google Colab
# from google.colab import drive
# drive.mount('/content/drive')
# Code to let colab access my google drive
from google.colab import drive
drive.mount('/content/drive')
Mounted at /content/drive
# loading data into a pandas dataframe
data = pd.read_csv("/content/drive/My Drive/marketing_campaign.csv", sep="\t")
Data Overview¶
# Code to check the first 5 rows of the dataset
data.head()
| ID | Year_Birth | Education | Marital_Status | Income | Kidhome | Teenhome | Dt_Customer | Recency | MntWines | MntFruits | MntMeatProducts | MntFishProducts | MntSweetProducts | MntGoldProds | NumDealsPurchases | NumWebPurchases | NumCatalogPurchases | NumStorePurchases | NumWebVisitsMonth | AcceptedCmp3 | AcceptedCmp4 | AcceptedCmp5 | AcceptedCmp1 | AcceptedCmp2 | Complain | Z_CostContact | Z_Revenue | Response | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 5524 | 1957 | Graduation | Single | 58138.0 | 0 | 0 | 04-09-2012 | 58 | 635 | 88 | 546 | 172 | 88 | 88 | 3 | 8 | 10 | 4 | 7 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 11 | 1 |
| 1 | 2174 | 1954 | Graduation | Single | 46344.0 | 1 | 1 | 08-03-2014 | 38 | 11 | 1 | 6 | 2 | 1 | 6 | 2 | 1 | 1 | 2 | 5 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 11 | 0 |
| 2 | 4141 | 1965 | Graduation | Together | 71613.0 | 0 | 0 | 21-08-2013 | 26 | 426 | 49 | 127 | 111 | 21 | 42 | 1 | 8 | 2 | 10 | 4 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 11 | 0 |
| 3 | 6182 | 1984 | Graduation | Together | 26646.0 | 1 | 0 | 10-02-2014 | 26 | 11 | 4 | 20 | 10 | 3 | 5 | 2 | 2 | 0 | 4 | 6 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 11 | 0 |
| 4 | 5324 | 1981 | PhD | Married | 58293.0 | 1 | 0 | 19-01-2014 | 94 | 173 | 43 | 118 | 46 | 27 | 15 | 5 | 5 | 3 | 6 | 5 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 11 | 0 |
# Code to check the last 5 rows of the dataset
data.tail()
| ID | Year_Birth | Education | Marital_Status | Income | Kidhome | Teenhome | Dt_Customer | Recency | MntWines | MntFruits | MntMeatProducts | MntFishProducts | MntSweetProducts | MntGoldProds | NumDealsPurchases | NumWebPurchases | NumCatalogPurchases | NumStorePurchases | NumWebVisitsMonth | AcceptedCmp3 | AcceptedCmp4 | AcceptedCmp5 | AcceptedCmp1 | AcceptedCmp2 | Complain | Z_CostContact | Z_Revenue | Response | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2235 | 10870 | 1967 | Graduation | Married | 61223.0 | 0 | 1 | 13-06-2013 | 46 | 709 | 43 | 182 | 42 | 118 | 247 | 2 | 9 | 3 | 4 | 5 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 11 | 0 |
| 2236 | 4001 | 1946 | PhD | Together | 64014.0 | 2 | 1 | 10-06-2014 | 56 | 406 | 0 | 30 | 0 | 0 | 8 | 7 | 8 | 2 | 5 | 7 | 0 | 0 | 0 | 1 | 0 | 0 | 3 | 11 | 0 |
| 2237 | 7270 | 1981 | Graduation | Divorced | 56981.0 | 0 | 0 | 25-01-2014 | 91 | 908 | 48 | 217 | 32 | 12 | 24 | 1 | 2 | 3 | 13 | 6 | 0 | 1 | 0 | 0 | 0 | 0 | 3 | 11 | 0 |
| 2238 | 8235 | 1956 | Master | Together | 69245.0 | 0 | 1 | 24-01-2014 | 8 | 428 | 30 | 214 | 80 | 30 | 61 | 2 | 6 | 5 | 10 | 3 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 11 | 0 |
| 2239 | 9405 | 1954 | PhD | Married | 52869.0 | 1 | 1 | 15-10-2012 | 40 | 84 | 3 | 61 | 2 | 1 | 21 | 3 | 3 | 1 | 4 | 7 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 11 | 1 |
Observations
The data has been loaded correctly. We can proceed to perform further analysis on the dataset.
# creating a copy of the data
data = data.copy()
# Code to check the shape of the dataset
num_rows, num_cols = data.shape
# Adding narrative to the output
print(f"The dataset has {num_rows} rows and {num_cols} columns.")
print(f"This means there are {num_rows} observations and {num_cols} features in the data.")
The dataset has 2240 rows and 29 columns. This means there are 2240 observations and 29 features in the data.
Calculating the age of the customer using the "Year Birth"¶
from datetime import datetime
# Get the current year
current_year = datetime.now().year
# Calculate the age of the customer
data['Age'] = current_year - data['Year_Birth']
Question 1: What are the data types of all the columns?¶
# Code to check the data types of the columns in the dataset
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 2240 entries, 0 to 2239 Data columns (total 30 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ID 2240 non-null int64 1 Year_Birth 2240 non-null int64 2 Education 2240 non-null object 3 Marital_Status 2240 non-null object 4 Income 2216 non-null float64 5 Kidhome 2240 non-null int64 6 Teenhome 2240 non-null int64 7 Dt_Customer 2240 non-null object 8 Recency 2240 non-null int64 9 MntWines 2240 non-null int64 10 MntFruits 2240 non-null int64 11 MntMeatProducts 2240 non-null int64 12 MntFishProducts 2240 non-null int64 13 MntSweetProducts 2240 non-null int64 14 MntGoldProds 2240 non-null int64 15 NumDealsPurchases 2240 non-null int64 16 NumWebPurchases 2240 non-null int64 17 NumCatalogPurchases 2240 non-null int64 18 NumStorePurchases 2240 non-null int64 19 NumWebVisitsMonth 2240 non-null int64 20 AcceptedCmp3 2240 non-null int64 21 AcceptedCmp4 2240 non-null int64 22 AcceptedCmp5 2240 non-null int64 23 AcceptedCmp1 2240 non-null int64 24 AcceptedCmp2 2240 non-null int64 25 Complain 2240 non-null int64 26 Z_CostContact 2240 non-null int64 27 Z_Revenue 2240 non-null int64 28 Response 2240 non-null int64 29 Age 2240 non-null int64 dtypes: float64(1), int64(26), object(3) memory usage: 525.1+ KB
Observations:¶
- The data type is made up of 1 float, 26 integers, and 3 objects, with a memory usage of 525.1KB
Question 2: Check the statistical summary of the data. What is the average household income?¶
# Code to check the statistical summary of the dataset
data.describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| ID | 2240.0 | 5592.159821 | 3246.662198 | 0.0 | 2828.25 | 5458.5 | 8427.75 | 11191.0 |
| Year_Birth | 2240.0 | 1968.805804 | 11.984069 | 1893.0 | 1959.00 | 1970.0 | 1977.00 | 1996.0 |
| Income | 2216.0 | 52247.251354 | 25173.076661 | 1730.0 | 35303.00 | 51381.5 | 68522.00 | 666666.0 |
| Kidhome | 2240.0 | 0.444196 | 0.538398 | 0.0 | 0.00 | 0.0 | 1.00 | 2.0 |
| Teenhome | 2240.0 | 0.506250 | 0.544538 | 0.0 | 0.00 | 0.0 | 1.00 | 2.0 |
| Recency | 2240.0 | 49.109375 | 28.962453 | 0.0 | 24.00 | 49.0 | 74.00 | 99.0 |
| MntWines | 2240.0 | 303.935714 | 336.597393 | 0.0 | 23.75 | 173.5 | 504.25 | 1493.0 |
| MntFruits | 2240.0 | 26.302232 | 39.773434 | 0.0 | 1.00 | 8.0 | 33.00 | 199.0 |
| MntMeatProducts | 2240.0 | 166.950000 | 225.715373 | 0.0 | 16.00 | 67.0 | 232.00 | 1725.0 |
| MntFishProducts | 2240.0 | 37.525446 | 54.628979 | 0.0 | 3.00 | 12.0 | 50.00 | 259.0 |
| MntSweetProducts | 2240.0 | 27.062946 | 41.280498 | 0.0 | 1.00 | 8.0 | 33.00 | 263.0 |
| MntGoldProds | 2240.0 | 44.021875 | 52.167439 | 0.0 | 9.00 | 24.0 | 56.00 | 362.0 |
| NumDealsPurchases | 2240.0 | 2.325000 | 1.932238 | 0.0 | 1.00 | 2.0 | 3.00 | 15.0 |
| NumWebPurchases | 2240.0 | 4.084821 | 2.778714 | 0.0 | 2.00 | 4.0 | 6.00 | 27.0 |
| NumCatalogPurchases | 2240.0 | 2.662054 | 2.923101 | 0.0 | 0.00 | 2.0 | 4.00 | 28.0 |
| NumStorePurchases | 2240.0 | 5.790179 | 3.250958 | 0.0 | 3.00 | 5.0 | 8.00 | 13.0 |
| NumWebVisitsMonth | 2240.0 | 5.316518 | 2.426645 | 0.0 | 3.00 | 6.0 | 7.00 | 20.0 |
| AcceptedCmp3 | 2240.0 | 0.072768 | 0.259813 | 0.0 | 0.00 | 0.0 | 0.00 | 1.0 |
| AcceptedCmp4 | 2240.0 | 0.074554 | 0.262728 | 0.0 | 0.00 | 0.0 | 0.00 | 1.0 |
| AcceptedCmp5 | 2240.0 | 0.072768 | 0.259813 | 0.0 | 0.00 | 0.0 | 0.00 | 1.0 |
| AcceptedCmp1 | 2240.0 | 0.064286 | 0.245316 | 0.0 | 0.00 | 0.0 | 0.00 | 1.0 |
| AcceptedCmp2 | 2240.0 | 0.013393 | 0.114976 | 0.0 | 0.00 | 0.0 | 0.00 | 1.0 |
| Complain | 2240.0 | 0.009375 | 0.096391 | 0.0 | 0.00 | 0.0 | 0.00 | 1.0 |
| Z_CostContact | 2240.0 | 3.000000 | 0.000000 | 3.0 | 3.00 | 3.0 | 3.00 | 3.0 |
| Z_Revenue | 2240.0 | 11.000000 | 0.000000 | 11.0 | 11.00 | 11.0 | 11.00 | 11.0 |
| Response | 2240.0 | 0.149107 | 0.356274 | 0.0 | 0.00 | 0.0 | 0.00 | 1.0 |
| Age | 2240.0 | 56.194196 | 11.984069 | 29.0 | 48.00 | 55.0 | 66.00 | 132.0 |
Observations:¶
The average household income is 52,247.25.
Question 3: Are there any missing values in the data? If yes, treat them using an appropriate method¶
# Code to check for missing values
data.isnull().sum()
| 0 | |
|---|---|
| ID | 0 |
| Year_Birth | 0 |
| Education | 0 |
| Marital_Status | 0 |
| Income | 24 |
| Kidhome | 0 |
| Teenhome | 0 |
| Dt_Customer | 0 |
| Recency | 0 |
| MntWines | 0 |
| MntFruits | 0 |
| MntMeatProducts | 0 |
| MntFishProducts | 0 |
| MntSweetProducts | 0 |
| MntGoldProds | 0 |
| NumDealsPurchases | 0 |
| NumWebPurchases | 0 |
| NumCatalogPurchases | 0 |
| NumStorePurchases | 0 |
| NumWebVisitsMonth | 0 |
| AcceptedCmp3 | 0 |
| AcceptedCmp4 | 0 |
| AcceptedCmp5 | 0 |
| AcceptedCmp1 | 0 |
| AcceptedCmp2 | 0 |
| Complain | 0 |
| Z_CostContact | 0 |
| Z_Revenue | 0 |
| Response | 0 |
| Age | 0 |
Observations:¶
- Income shows a missing value of 24 in the count
We will fill the missing value in the Income column by imputing the median.
data["Income"] = data["Income"].fillna(data["Income"].median()) ## Complete the code to impute the data with median
# checking for missing values after treatment by imputation
data.isnull().sum() ## Complete the code to check missing values after imputation
| 0 | |
|---|---|
| ID | 0 |
| Year_Birth | 0 |
| Education | 0 |
| Marital_Status | 0 |
| Income | 0 |
| Kidhome | 0 |
| Teenhome | 0 |
| Dt_Customer | 0 |
| Recency | 0 |
| MntWines | 0 |
| MntFruits | 0 |
| MntMeatProducts | 0 |
| MntFishProducts | 0 |
| MntSweetProducts | 0 |
| MntGoldProds | 0 |
| NumDealsPurchases | 0 |
| NumWebPurchases | 0 |
| NumCatalogPurchases | 0 |
| NumStorePurchases | 0 |
| NumWebVisitsMonth | 0 |
| AcceptedCmp3 | 0 |
| AcceptedCmp4 | 0 |
| AcceptedCmp5 | 0 |
| AcceptedCmp1 | 0 |
| AcceptedCmp2 | 0 |
| Complain | 0 |
| Z_CostContact | 0 |
| Z_Revenue | 0 |
| Response | 0 |
| Age | 0 |
Observations
The missing value has now been treated.
Question 4: Are there any duplicates in the data?¶
# Code to check for duplicates
data.duplicated().sum()
0
Observations:¶
- The dataset does not have any duplicates
Dropping columns which are irrelevant to our analysis.¶
columns_to_drop = ['Dt_Customer','Year_Birth','ID','AcceptedCmp1', 'Z_CostContact', 'Z_Revenue', 'AcceptedCmp2', 'AcceptedCmp3', 'AcceptedCmp4', 'AcceptedCmp5', 'Education', 'Marital_Status']
data.drop(columns=columns_to_drop, inplace=True)
Exploratory Data Analysis¶
Univariate Analysis¶
Question 5: Explore all the variables and provide observations on their distributions (histograms and boxplots)¶
# Code to check the shape of the dataset
num_rows, num_cols = data.shape
# Adding narrative to the output
print(f"The dataset has {num_rows} rows and {num_cols} columns.")
print(f"This means there are {num_rows} observations and {num_cols} features in the data.")
The dataset has 2240 rows and 18 columns. This means there are 2240 observations and 18 features in the data.
PLotting the histogram of each column.¶
# defining the figure size
# defining the figure size
plt.figure(figsize=(12, 10))
for i, feature in enumerate(data.columns): #iterating through each column
plt.subplot(6, 3, i+1) # assign a subplot in the main plot
sns.histplot(data= data, x= feature, kde = True) # plot the histogram
plt.tight_layout(); # to add spacing between plots
Observations
Most of the distributions are right-skewed. A greater amount of the data is concentrated on the left, at the lower end of the spectrum.
PLotting the boxplot of each column.¶
# defining the figure size
plt.figure(figsize=(12, 10))
# defining the figure size
plt.figure(figsize=(12, 10))
# plotting the boxplot for each numerical feature
for i, feature in enumerate(data.columns): # iterating through each column
plt.subplot(6, 3, i+1) # assign a subplot in the main plot
sns.boxplot(data=data, x=feature) # plot the boxplot
plt.tight_layout(); # to add spacing between plots
<Figure size 1200x1000 with 0 Axes>
Observations
The above plots summarize the distribution of both discrete and continuous variables in the dataset.
Distribution of some features are skewed while others are somewhat symmetrical.
Income: The histogram is right-skewed; this is an indication that the majority of customers have lower incomes.
Kidhome and Teenhome: most of the customers have zero kids or teenagers staying with them.
Recency: This is a right-skewed distribution. There are lots of customers that have not made purchases recently.
MntWines, MntFruits, MntMeatProducts, MntFishProducts, MntSweetProducts, MntGoldProds: Purchase variables about wines and foods products. These variables show different types of right-skewness. Majority of customers spend less on these items.
Complain: The distribution of this variable is highly skewed to the right. it can be seen that only a few customers have complained.
Age: This is skewed to the right, indicating the majority of the customers are young.
Some outliers can b e observed in the data. These will not be treated as they form part of the dataset.
Bivariate Analysis¶
Question 6: Perform multivariate analysis to explore the relationsips between the variables.¶
Let's check for correlations.
# Code to check for correlation
plt.figure(figsize=(15, 7))
sns.heatmap(data.corr(), annot=True)
plt.show()
Observations:
Several groups of spending categories (MntWines, MntFruits, MntMeatProducts, MntFishProducts, MntSweetProducts, MntGoldProds) show strong positive correlations amongst themselves. This suggests customers who spend more on one type of product tend to spend more on others.
Kidhome (number of children in the household) shows moderate negative correlations with Income and most spending categories. This implies that families with more children tend to have lower incomes and spend less on these products.
Income plays a role in spending behavior, particularly for certain product types.
The size of the family (kidhome) appears to have a negative impact on income.
Quite a number of the variables do not have any correlation.
Let's check for pairplots.
sns.pairplot(data=data, diag_kind="kde")
plt.show()
Data Preprocessing¶
Scaling¶
- Let's scale the data before we proceed with clustering.
# scaling the data before clustering
scaler = StandardScaler()
subset = data.copy()
subset_scaled = scaler.fit_transform(subset)
# creating a dataframe of the scaled data
subset_scaled_data = pd.DataFrame(subset_scaled, columns=subset.columns)
K-means Clustering¶
k_means_data = subset_scaled_data.copy() # Code will be used later in cluster profiling
Question 7 : Select the appropriate number of clusters using the elbow Plot. What do you think is the appropriate number of clusters?¶
clusters = range(2, 11)
wcss_k8 = []
for k in clusters:
model = KMeans(n_clusters=k, random_state=1) # initialize the kmeans model with n_clusters=k
model.fit(subset_scaled_data) # fit the kmeans model on the scaled data (subset_scaled_data)
wcss = model.inertia_
wcss_k8.append(wcss)
print("Number of Clusters:", k, "\tWCSS:",wcss)
plt.plot(clusters, wcss_k8, "bx-", marker='o')
plt.xlabel("k")
plt.ylabel("WCSS")
plt.title("Selecting k with the Elbow Method", fontsize=20)
plt.show()
Number of Clusters: 2 WCSS: 29497.59180741321 Number of Clusters: 3 WCSS: 26230.895399686044 Number of Clusters: 4 WCSS: 24836.807962396113 Number of Clusters: 5 WCSS: 23798.198564507387 Number of Clusters: 6 WCSS: 22905.78476209626 Number of Clusters: 7 WCSS: 21972.671605951775 Number of Clusters: 8 WCSS: 19867.06721740295 Number of Clusters: 9 WCSS: 19201.858363788044 Number of Clusters: 10 WCSS: 18790.32967985496
Observations:¶
- The appropriate number of clusters appears to be between 2 and 3.
Question 8 : finalize appropriate number of clusters by checking the silhoutte score as well. Is the answer different from the elbow plot?¶
sil_score = [] # Define an empty list to store silhouette scores
cluster_list = range(2, 10)
for n_clusters in cluster_list:
# Initialize the kmeans model with the current value of n_clusters
clusterer = KMeans(n_clusters=n_clusters, random_state=1)
# Fit the kmeans model to the scaled data (k_means_data)
preds = clusterer.fit_predict(k_means_data)
score = silhouette_score(k_means_data, preds) # Check the silhoutte score against the predictions
sil_score.append(score) # Assuming sil_score is a list defined earlier to store the scores
print("For n_clusters = {}, the silhouette score is {}".format(n_clusters, score))
For n_clusters = 2, the silhouette score is 0.28417340291067655 For n_clusters = 3, the silhouette score is 0.21188794461530194 For n_clusters = 4, the silhouette score is 0.14437407459098453 For n_clusters = 5, the silhouette score is 0.1389441529164155 For n_clusters = 6, the silhouette score is 0.1441860919328272 For n_clusters = 7, the silhouette score is 0.1466393276809718 For n_clusters = 8, the silhouette score is 0.15470127342280662 For n_clusters = 9, the silhouette score is 0.14604007088599213
Observations
n_clusters=2 has the highest silhoutte score of 0.28
# Find the optimal number of clusters
optimal_n_clusters = cluster_list[sil_score.index(max(sil_score))]
print("\nOptimal number of clusters:", optimal_n_clusters)
Optimal number of clusters: 2
#Empty dictionary to store the Silhouette score for each value of k
sc = {}
# iterate for a range of Ks and fit the scaled data to the algorithm. Store the Silhouette score for that k
for k in range(2, 10):
kmeans = KMeans(n_clusters=k).fit(subset_scaled_data) # Use subset_scaled_data instead of data_scaled
labels = kmeans.predict(subset_scaled_data) # Use subset_scaled_data instead of data_scaled
sc[k] = silhouette_score(subset_scaled_data, labels) # Use subset_scaled_data instead of data_scaled
#Elbow plot
plt.figure()
plt.plot(list(sc.keys()), list(sc.values()), 'bx-')
plt.xlabel("Number of cluster")
plt.ylabel("Silhouette Score")
plt.show()
Observations
k=2 is Optimal: The highest silhouette score is observed at k=2. This suggests that dividing the data into two clusters results in the best separation and cohesion within the clusters.
#Calculating summary statistics of the original data for each label
data['Labels'] = kmeans.labels_
mean = data.groupby('Labels').mean()
median = data.groupby('Labels').median()
df_kmeans = pd.concat([mean, median], axis=0)
# Get the number of unique cluster labels
n_clusters = data['Labels'].nunique()
# Dynamically create index labels based on the number of clusters
index_labels = []
for i in range(n_clusters):
index_labels.extend([f'group_{i} Mean', f'group_{i} Median'])
# Set the index of the DataFrame
df_kmeans.index = index_labels
df_kmeans.T
| group_0 Mean | group_0 Median | group_1 Mean | group_1 Median | group_2 Mean | group_2 Median | group_3 Mean | group_3 Median | group_4 Mean | group_4 Median | group_5 Mean | group_5 Median | group_6 Mean | group_6 Median | group_7 Mean | group_7 Median | group_8 Mean | group_8 Median | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Income | 67990.975248 | 42629.492152 | 76130.079787 | 29298.816981 | 59054.925127 | 30192.558065 | 45242.285714 | 80062.090909 | 49837.863333 | 69084.0 | 42101.0 | 76542.5 | 29791.0 | 59537.5 | 29510.5 | 38998.0 | 77917.5 | 50637.5 |
| Kidhome | 0.148515 | 0.652466 | 0.026596 | 0.841509 | 0.040609 | 0.870968 | 0.666667 | 0.015152 | 0.946667 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 1.0 | 1.0 | 0.0 | 1.0 |
| Teenhome | 0.495050 | 0.959641 | 0.202128 | 0.067925 | 0.979695 | 0.022581 | 0.523810 | 0.026515 | 0.926667 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 |
| Recency | 48.064356 | 50.742152 | 48.101064 | 22.105660 | 48.530457 | 70.961290 | 53.047619 | 50.787879 | 47.486667 | 47.5 | 50.0 | 48.0 | 19.0 | 49.0 | 74.0 | 49.0 | 55.0 | 51.0 |
| MntWines | 631.841584 | 56.513453 | 528.436170 | 28.433962 | 494.076142 | 34.032258 | 169.000000 | 630.219697 | 306.366667 | 587.5 | 35.0 | 493.5 | 12.0 | 457.5 | 14.0 | 34.0 | 586.5 | 241.5 |
| MntFruits | 55.321782 | 3.352018 | 117.829787 | 6.279245 | 18.667513 | 6.932258 | 24.190476 | 40.064394 | 12.280000 | 48.5 | 1.0 | 120.0 | 3.0 | 12.0 | 3.0 | 6.0 | 31.0 | 6.0 |
| MntMeatProducts | 268.386139 | 20.224215 | 463.351064 | 21.596226 | 127.385787 | 28.625806 | 112.476190 | 532.518939 | 105.933333 | 240.0 | 13.5 | 430.0 | 13.0 | 106.0 | 15.5 | 30.0 | 482.5 | 87.0 |
| MntFishProducts | 79.400990 | 5.356502 | 144.079787 | 8.675472 | 25.329949 | 9.829032 | 25.761905 | 74.678030 | 19.733333 | 69.0 | 2.0 | 150.0 | 4.0 | 16.0 | 6.0 | 7.0 | 63.0 | 10.0 |
| MntSweetProducts | 74.094059 | 3.704036 | 102.053191 | 5.852830 | 16.690355 | 6.800000 | 17.523810 | 44.700758 | 16.080000 | 63.5 | 1.0 | 102.5 | 3.0 | 11.0 | 4.0 | 4.0 | 35.0 | 6.0 |
| MntGoldProds | 80.668317 | 12.668161 | 100.595745 | 18.479245 | 59.880711 | 16.464516 | 27.476190 | 58.784091 | 53.746667 | 64.5 | 7.0 | 83.0 | 12.0 | 39.0 | 11.0 | 17.0 | 43.0 | 39.0 |
| NumDealsPurchases | 2.108911 | 2.053812 | 1.335106 | 1.773585 | 2.926396 | 1.841935 | 2.333333 | 1.174242 | 7.080000 | 2.0 | 2.0 | 1.0 | 1.0 | 3.0 | 1.5 | 2.0 | 1.0 | 6.0 |
| NumWebPurchases | 7.747525 | 1.997758 | 4.978723 | 2.052830 | 6.218274 | 2.296774 | 3.619048 | 4.193182 | 5.793333 | 8.0 | 2.0 | 5.0 | 2.0 | 6.0 | 2.0 | 3.0 | 4.0 | 6.0 |
| NumCatalogPurchases | 4.915842 | 0.594170 | 5.957447 | 0.528302 | 3.149746 | 0.496774 | 2.047619 | 6.352273 | 2.200000 | 4.0 | 0.0 | 6.0 | 0.0 | 3.0 | 0.0 | 1.0 | 6.0 | 2.0 |
| NumStorePurchases | 8.564356 | 3.459641 | 8.414894 | 2.992453 | 8.012690 | 3.209677 | 5.238095 | 8.234848 | 5.906667 | 9.0 | 3.0 | 8.0 | 3.0 | 8.0 | 3.0 | 3.0 | 8.0 | 6.0 |
| NumWebVisitsMonth | 5.272277 | 5.616592 | 2.462766 | 6.924528 | 5.355330 | 6.906452 | 5.809524 | 2.117424 | 7.393333 | 5.0 | 6.0 | 2.0 | 7.0 | 5.0 | 7.0 | 7.0 | 2.0 | 7.0 |
| Complain | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 0.000000 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 |
| Response | 0.331683 | 0.015695 | 0.196809 | 0.290566 | 0.068528 | 0.000000 | 0.142857 | 0.284091 | 0.273333 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| Age | 56.019802 | 61.937220 | 55.829787 | 47.203774 | 61.974619 | 47.090323 | 59.904762 | 57.204545 | 57.026667 | 55.0 | 61.0 | 54.0 | 47.0 | 62.0 | 47.0 | 61.0 | 58.0 | 55.0 |
Observations:¶
Question 9: Do a final fit with the appropriate number of clusters. How much total time does it take for the model to fit the data?¶
# Wri%%time
kmeans = KMeans(n_clusters=2, random_state=0) # Code with the desired number of clusters (n_cluster=2)
kmeans.fit(k_means_data)
KMeans(n_clusters=2, random_state=0)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
KMeans(n_clusters=2, random_state=0)
import time
start_time = time.time()
kmeans = KMeans(n_clusters=2, random_state=0)
kmeans.fit(k_means_data)
end_time = time.time()
total_time = end_time - start_time
print(f"Total time to fit the model: {total_time:.4f} seconds")
Total time to fit the model: 0.0137 seconds
# creating a copy of the original data
data1 = data.copy()
# adding kmeans cluster labels to the original and scaled dataframes
k_means_data["K_means_segments"] = kmeans.labels_
data1["K_means_segments"] = kmeans.labels_
Hierarchical Clustering¶
hc_data = subset_scaled_data.copy()
Question 10: Calculate the cophenetic correlation for every combination of distance metrics and linkage. Which combination has the highest cophnetic correlation?¶
# list of distance metrics
distance_metrics = ["euclidean", "chebyshev", "mahalanobis", "cityblock"]
# list of linkage methods
linkage_methods = ["single", "complete", "average", "weighted"]
high_cophenet_corr = 0
high_dm_lm = [0, 0]
for dm in distance_metrics:
for lm in linkage_methods:
Z = linkage(hc_data, metric=dm, method=lm) # Use dm and lm variables for metric and method
c, coph_dists = cophenet(Z, pdist(hc_data))
print(
"Cophenetic correlation for {} distance and {} linkage is {}.".format(
dm.capitalize(), lm, c
)
)
if high_cophenet_corr < c:
high_cophenet_corr = c
high_dm_lm[0] = dm
high_dm_lm[1] = lm
Cophenetic correlation for Euclidean distance and single linkage is 0.7533004792313172. Cophenetic correlation for Euclidean distance and complete linkage is 0.73612253810357. Cophenetic correlation for Euclidean distance and average linkage is 0.8541236643208112. Cophenetic correlation for Euclidean distance and weighted linkage is 0.8108803552015535. Cophenetic correlation for Chebyshev distance and single linkage is 0.6556214794929751. Cophenetic correlation for Chebyshev distance and complete linkage is 0.6759851614442833. Cophenetic correlation for Chebyshev distance and average linkage is 0.7695330133310652. Cophenetic correlation for Chebyshev distance and weighted linkage is 0.7193948146606922. Cophenetic correlation for Mahalanobis distance and single linkage is 0.7820706370993079. Cophenetic correlation for Mahalanobis distance and complete linkage is 0.6843897408011711. Cophenetic correlation for Mahalanobis distance and average linkage is 0.8224887488724775. Cophenetic correlation for Mahalanobis distance and weighted linkage is 0.6999145308941309. Cophenetic correlation for Cityblock distance and single linkage is 0.8060129075261067. Cophenetic correlation for Cityblock distance and complete linkage is 0.5280471519009835. Cophenetic correlation for Cityblock distance and average linkage is 0.7915031255045084. Cophenetic correlation for Cityblock distance and weighted linkage is 0.7649520275546621.
# printing the combination of distance metric and linkage method with the highest cophenetic correlation
print(
"Highest cophenetic correlation is {}, which is obtained with {} distance and {} linkage.".format(
high_cophenet_corr, high_dm_lm[0].capitalize(), high_dm_lm[1]
)
)
Highest cophenetic correlation is 0.8541236643208112, which is obtained with Euclidean distance and average linkage.
Question 11: plot the dendogram for every linkage method with "Euclidean" distance only. What should be the appropriate linkage according to the plot?¶
Let's view the dendrograms for the different linkage methods.
# list of linkage methods
linkage_methods = ["single", "complete", "average", "centroid", "ward", "weighted"]
# lists to save results of cophenetic correlation calculation
compare_cols = ["Linkage", "Cophenetic Coefficient"]
compare = []
# to create a subplot image
fig, axs = plt.subplots(len(linkage_methods), 1, figsize=(15, 30))
# We will enumerate through the list of linkage methods above
# For each linkage method, we will plot the dendrogram and calculate the cophenetic correlation
for i, method in enumerate(linkage_methods):
# Calculating the linkage with Euclidean distance and the current linkage method
Z = linkage(hc_data, metric="euclidean", method=method)
# Visualizing the Dendrogram with the calculated linkage matrix Z
dendrogram(Z, ax=axs[i])
axs[i].set_title(f"Dendrogram ({method.capitalize()} Linkage)")
coph_corr, coph_dist = cophenet(Z, pdist(hc_data))
axs[i].annotate(
f"Cophenetic\nCorrelation\n{coph_corr:0.2f}",
(0.80, 0.80),
xycoords="axes fraction",
)
Observations:¶
The Ward linkage appears to be the most promising method for clustering the data using Euclidean distance even though it has the least cophenetic correlation of 0.47.
It shows a clear separation of clusters with distinct vertical lines, indicating well-defined clusters.
Question 12: Check the silhoutte score for the hierchical clustering. What should be the appropriate number of clusters according to this plot?¶
sil_score_hc = []
cluster_list = list(range(2, 10))
for n_clusters in cluster_list:
# Initialize the model with the current number of clusters from cluster_list
clusterer = AgglomerativeClustering(n_clusters=n_clusters)
# Fit the model on the scaled data (hc_data) and get predictions
preds = clusterer.fit_predict(hc_data)
# Calculate the silhouette score using hc_data and the predictions
score = silhouette_score(hc_data, preds)
sil_score_hc.append(score)
print("For n_clusters = {}, silhouette score is {}".format(n_clusters, score))
For n_clusters = 2, silhouette score is 0.2403229059475717 For n_clusters = 3, silhouette score is 0.19529185338217378 For n_clusters = 4, silhouette score is 0.20968348277361193 For n_clusters = 5, silhouette score is 0.1263514012994975 For n_clusters = 6, silhouette score is 0.1248134339371427 For n_clusters = 7, silhouette score is 0.12760383576238057 For n_clusters = 8, silhouette score is 0.14095210062230012 For n_clusters = 9, silhouette score is 0.14158008628847135
Observations:¶
- The highest silhouette score is 0.2403, which is achieved when n_clusters = 2. Forming 2 clusters might be the most appropriate choice based on the silhouette score.
Question 13: Fit the Hierarchial clustering model with the appropriate parameters finalized above. How much time does it take to fit the model?¶
%%time
HCmodel = AgglomerativeClustering(n_clusters=2, metric="euclidean", linkage="ward") # Initialize the HC model with appropriate parameters.
HCmodel.fit(hc_data)
CPU times: user 280 ms, sys: 25.9 ms, total: 306 ms Wall time: 298 ms
AgglomerativeClustering()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
AgglomerativeClustering()
Observations
- It takes a total time of 182 ms to fit the model.
# creating a copy of the original data
data2 = data.copy()
# adding hierarchical cluster labels to the original and scaled dataframes
hc_data["HC_segments"] = HCmodel.labels_
data2["HC_segments"] = HCmodel.labels_
hc_data.head()
| Income | Kidhome | Teenhome | Recency | MntWines | MntFruits | MntMeatProducts | MntFishProducts | MntSweetProducts | MntGoldProds | NumDealsPurchases | NumWebPurchases | NumCatalogPurchases | NumStorePurchases | NumWebVisitsMonth | Complain | Response | Age | HC_segments | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.235696 | -0.825218 | -0.929894 | 0.307039 | 0.983781 | 1.551577 | 1.679702 | 2.462147 | 1.476500 | 0.843207 | 0.349414 | 1.409304 | 2.510890 | -0.550785 | 0.693904 | -0.097282 | 2.388846 | 0.985345 | 0 |
| 1 | -0.235454 | 1.032559 | 0.906934 | -0.383664 | -0.870479 | -0.636301 | -0.713225 | -0.650449 | -0.631503 | -0.729006 | -0.168236 | -1.110409 | -0.568720 | -1.166125 | -0.130463 | -0.097282 | -0.418612 | 1.235733 | 1 |
| 2 | 0.773999 | -0.825218 | -0.929894 | -0.798086 | 0.362723 | 0.570804 | -0.177032 | 1.345274 | -0.146905 | -0.038766 | -0.685887 | 1.409304 | -0.226541 | 1.295237 | -0.542647 | -0.097282 | -0.418612 | 0.317643 | 0 |
| 3 | -1.022355 | 1.032559 | -0.929894 | -0.798086 | -0.870479 | -0.560857 | -0.651187 | -0.503974 | -0.583043 | -0.748179 | -0.168236 | -0.750450 | -0.910898 | -0.550785 | 0.281720 | -0.097282 | -0.418612 | -1.268149 | 1 |
| 4 | 0.241888 | 1.032559 | -0.929894 | 1.550305 | -0.389085 | 0.419916 | -0.216914 | 0.155164 | -0.001525 | -0.556446 | 1.384715 | 0.329427 | 0.115638 | 0.064556 | -0.130463 | -0.097282 | -0.418612 | -1.017761 | 1 |
data2.head()
| Income | Kidhome | Teenhome | Recency | MntWines | MntFruits | MntMeatProducts | MntFishProducts | MntSweetProducts | MntGoldProds | NumDealsPurchases | NumWebPurchases | NumCatalogPurchases | NumStorePurchases | NumWebVisitsMonth | Complain | Response | Age | Labels | HC_segments | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 58138.0 | 0 | 0 | 58 | 635 | 88 | 546 | 172 | 88 | 88 | 3 | 8 | 10 | 4 | 7 | 0 | 1 | 68 | 0 | 0 |
| 1 | 46344.0 | 1 | 1 | 38 | 11 | 1 | 6 | 2 | 1 | 6 | 2 | 1 | 1 | 2 | 5 | 0 | 0 | 71 | 1 | 1 |
| 2 | 71613.0 | 0 | 0 | 26 | 426 | 49 | 127 | 111 | 21 | 42 | 1 | 8 | 2 | 10 | 4 | 0 | 0 | 60 | 0 | 0 |
| 3 | 26646.0 | 1 | 0 | 26 | 11 | 4 | 20 | 10 | 3 | 5 | 2 | 2 | 0 | 4 | 6 | 0 | 0 | 41 | 3 | 1 |
| 4 | 58293.0 | 1 | 0 | 94 | 173 | 43 | 118 | 46 | 27 | 15 | 5 | 5 | 3 | 6 | 5 | 0 | 0 | 44 | 5 | 1 |
subset_scaled_data["HC_Clusters"] = HCmodel.labels_
data["HC_Clusters"] = HCmodel.labels_
Cluster Profiling and Comparison¶
K-Means Clustering vs Hierarchical Clustering Comparison¶
Question 14: Perform and compare Cluster profiling on both algorithms using boxplots. Based on the all the observations Which one of them provides better clustering?¶
plt.figure(figsize=(20, 20))
plt.suptitle("Boxplot of numerical variables for each cluster in Kmeans Clustering")
# Get the number of numerical columns to plot
num_cols = len(data1.select_dtypes(include=['number']).columns)
num_rows = (num_cols + 2) // 3 # Calculate rows for subplots
# Iterate over numerical variables and create boxplots
for i, variable in enumerate(data1.select_dtypes(include=['number']).columns):
if variable != "K_means_segments": # Skip the cluster label column itself
plt.subplot(num_rows, 3, i + 1)
# Filter data to include only the desired cluster labels (0 and 1)
filtered_data = data1[data1['K_means_segments'].isin([0, 1])]
sns.boxplot(data=filtered_data, x="K_means_segments", y=variable, palette='Spectral')
plt.tight_layout(pad=2.0)
plt.show()
plt.figure(figsize=(20, 20))
plt.suptitle("Boxplot of numerical variables for each cluster in Hierarchial Clustering")
# Get the number of numerical columns to plot
num_cols = len(data2.select_dtypes(include=['number']).columns)
num_rows = (num_cols + 2) // 3 # Calculate rows for subplots
# Iterate over numerical variables and create boxplots
for i, variable in enumerate(data2.select_dtypes(include=['number']).columns):
if variable != "HC_segments": # Skip the cluster label column itself
plt.subplot(num_rows, 3, i + 1)
# Filter data to include only the desired cluster labels (0 and 1)
filtered_data = data2[data2['HC_segments'].isin([0, 1])]
sns.boxplot(data=filtered_data, x="HC_segments", y=variable, palette='Spectral')
plt.tight_layout(pad=2.0)
plt.show()
Observations:¶
K-Means clustering appears to show better separation in certain variables compared to Hierarchical Clustering. Hierarchical clustering tends to have overlapping distributions in multiple cases.
K-Means is the preferred choice for clear segmentation with well-separated clusters
Question 15: Perform Cluster profiling on the data with the appropriate algorithm determined above using a barplot. What observations can be derived for each cluster from this plot?¶
plt.figure(figsize=(20, 20))
plt.suptitle("Barplots of all variables for each cluster")
# Filter data to include only the desired cluster labels (0 and 1)
filtered_data = data1[data1['K_means_segments'].isin([0, 1])]
# Get the number of numerical columns to plot
num_cols = len(filtered_data.select_dtypes(include=['number']).columns)
num_rows = (num_cols + 2) // 3 # Calculate rows for subplots
# Iterate over numerical variables and create barplots
for i, variable in enumerate(filtered_data.select_dtypes(include=['number']).columns):
if variable != "K_means_segments": # Skip the cluster label column itself
plt.subplot(num_rows, 3, i + 1)
sns.barplot(data=filtered_data, x="K_means_segments", y=variable, palette='Spectral', errorbar=None)
plt.tight_layout(pad=2.0)
plt.show()
Observations:¶
- Income: Cluster 1 presents much higher median income values compared to Cluster 0.
- Recency: The median of Cluster 0 is significantly higher, indicating that in this cluster, the customers have been inactive for a longer period.
- Product Spending: The first cluster has spent much on wines and meat products. Cluster 0 spends more on sweet products and fish products compared to Cluster 1.
- Purchasing Behaviour: Cluster 1 spends more on their purchases through web and catalog, whereas Cluster 0 does their purchases from the stores and is engaged in purchasing more deals.
- Kidhome and Teenhome -There is little variation in this cluster on both these variables; perhaps these were very weak drivers in the cluster identification process.
# lets display cluster profile
# Assuming data1 contains cluster labels and relevant features
cluster_profile = data1.groupby('K_means_segments').mean()
# Style the output
cluster_profile.style.highlight_max(color="green", axis=0)
| Income | Kidhome | Teenhome | Recency | MntWines | MntFruits | MntMeatProducts | MntFishProducts | MntSweetProducts | MntGoldProds | NumDealsPurchases | NumWebPurchases | NumCatalogPurchases | NumStorePurchases | NumWebVisitsMonth | Complain | Response | Age | Labels | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| K_means_segments | |||||||||||||||||||
| 0 | 38840.161643 | 0.706104 | 0.555388 | 48.812359 | 103.334589 | 6.301432 | 36.353429 | 9.343632 | 6.252449 | 22.100226 | 2.560663 | 2.920121 | 0.864356 | 3.914846 | 6.444612 | 0.010550 | 0.096458 | 54.876413 | 3.445365 |
| 1 | 71711.030120 | 0.063527 | 0.434830 | 49.541073 | 595.499452 | 55.372399 | 356.765608 | 78.486309 | 57.309967 | 75.883899 | 1.982475 | 5.777656 | 5.274918 | 8.515882 | 3.676889 | 0.007667 | 0.225630 | 58.109529 | 3.663746 |
# lets display cluster profile
# Assuming data2 contains hierarchical cluster labels and relevant features
cluster_profile = data2.groupby('HC_segments').mean() # Use data2 and 'HC_segments'
# Style the output
cluster_profile.style.highlight_max(color="green", axis=0)
| Income | Kidhome | Teenhome | Recency | MntWines | MntFruits | MntMeatProducts | MntFishProducts | MntSweetProducts | MntGoldProds | NumDealsPurchases | NumWebPurchases | NumCatalogPurchases | NumStorePurchases | NumWebVisitsMonth | Complain | Response | Age | Labels | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| HC_segments | |||||||||||||||||||
| 0 | 67501.717257 | 0.132743 | 0.553097 | 48.497345 | 541.935398 | 46.183186 | 301.081416 | 66.417699 | 47.736283 | 70.505310 | 2.571681 | 5.804425 | 4.598230 | 8.072566 | 4.323894 | 0.000000 | 0.202655 | 58.578761 | 3.965487 |
| 1 | 36699.211261 | 0.761261 | 0.458559 | 49.732432 | 61.647748 | 6.063063 | 30.401802 | 8.112613 | 6.017117 | 17.061261 | 2.073874 | 2.334234 | 0.690991 | 3.466667 | 6.327027 | 0.018919 | 0.094595 | 53.766667 | 3.095495 |
Actionable Insights and Recommendations¶
K-Means Segments
Cluster 0 represents lower-income, budget-conscious shoppers Moderate age (55 years). More children at home. Lower spending across all categories. Prefers in-store shopping, browses online but purchases less. Less responsive to promotions.
Cluster 1 represents wealthier, high-spending customers Fewer children at home. Significantly higher spending on all product categories. More catalog and web purchases. More engaged with promotions and less likely to complain.
Hierarchical Segments
HC Clustering offers a sharper contrast between high- and low-income groups
The income gap between clusters is more pronounced in HC (USD 67.5K vs. USD36.7K) than in K-Means.
HC segments are more extreme, meaning low-income consumers buy even less, and high-income consumers buy even more.
HC Clustering shows more separation in purchase behaviors
The difference in spending on wine, meat, and luxury items is sharper in HC clustering.
HC low-income consumers visit websites more but purchase less.
K-Means has more balanced segmentation
The low-income segment (Cluster 0 in K-Means) still spends on some products, while in HC, they spend significantly less.
Response to promotions is more extreme in HC (20.2% vs. 9.4%), while in K-Means, it's more evenly spread.
Business Recommendations
- Based on different purchasing behaviours and preferences, marketing campaigns could be targeted and tailored to each segment's needs.
- For high-value customers, implement customer retention strategies to foster long-term relationships with this segment. For example, provide personalized customer service to enhance their experience and encourage repeat orders.
- Enhance the in-store experience for customers who prefer to shop in-person
- For customers who prefer to shop online, invest in online and catalog marketing to reach them and provide a seamless shopping journey across these channels.